Clustering

Author

Sheeba Moghal

Introduction to Clustering

When we talk about clustering, we often talk about the concept of ‘unsupervised learning’ method, that also brings up the concept of ‘supervised learning’ method.

When we talk about supervised machine learning, it is when there is a defined relationship between the independent and dependent variables of X and Y respectively. If it is for numerical inputs, it is regression and if it is for a class variable, then it is classification.

However, data is never straightforward and there are often instances wherein there is no relationship between the each of the variables. During those instances, employing the un-supervised learning method is the best choice. Clustering, is an unsupervised learning technique that looks for data with no labels, and discovers clusters or groups when there is similarity between the data points. This helps in understanding the internal structure of the data and to understand the patterns within the dataset. In the economic dataset, that contains the fiscal and monetary data of BRICS nations, clustering could be useful to understand if there are well-defining indicators that highlight the performance of each nation, let’s say in terms of GDP. For instance, if a set of features are clustering together, it means there are some features that are well-defined to those specific instances. If certain variables are clustering, it means that maybe there is some sort of information that could be explored to ethically understand the contribution of factors.

K-Means Clustering

-talk and introduce about clustering and the types - introduce about k means clustering and what it does, and cite it -

Code
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import warnings

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans


from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import numpy as np
from sklearn.cluster import AgglomerativeClustering

from scipy.cluster.hierarchy import dendrogram, linkage

As my data consists of a self-defined variable X and Y, we see that for clustering, the dependant variable or Y variable has to be dropped.

Code
stackeddf= pd.read_csv("../data/stackeddf.csv")

stackeddf.head(10)
y = stackeddf['labels']
x = stackeddf.drop(['labels'], axis=1)
x1 = x.drop(['year'], axis=1)


label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

scaler = StandardScaler()
scaled_data = scaler.fit_transform(x1)

Optimal K-Means : Elbow Method

Code
# to ignore future warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="sklearn.cluster._kmeans")


# optimal clustering using the elbow method
wcss = [] 
for i in range(1, 8): 
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(x) 
    wcss.append(kmeans.inertia_)

# Plotting the WCSS values
plt.plot(range(1, 8), wcss, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS') 
plt.show()

In the code above, to avoid the warnings, the warnings module has been used. For k-means clustering, finding the optimal amount of clustering is important for better interpretability, higher performance and for informed decision making. The elbow method is used to find the optimal amount of clusters ‘K’. It uses the WCSS (Within-Cluster Sum of Square), what is calculated using the distance between the points of the cluster centroid and the points of the cluster. So a loop has been created where for each value of k between 1 and 20, we calculate the WCSS amd plot it that resembles an Elbow. As the clusters increase, the WCSS value decreses. The optimal k value is the change in the shape of the point. The optimal amount of cluster here is 6.

Optimal K-Means : Silhoutte Score Method

Code
# silhoutte score to find optimal k clusters

range_clusters = range(2, 8)

silhoutte = []

for n in range_clusters:
    kmeans = KMeans(n_clusters=n, random_state=2339)
    kmeans.fit(x)
    cluster_labels = kmeans.labels_
    silhoutte.append(silhouette_score(x, cluster_labels))

# plotting
plt.plot(range_clusters, silhoutte, 'bx-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis for Optimal k')
plt.show()

optimal_clusters = range_clusters[np.argmax(silhoutte)]
print(f"Optimal Number of Clusters: {optimal_clusters}")

Optimal Number of Clusters: 6

Finding the optimal k clusters can also be done through the silhouette score method that quantifies how similar the data point is within a cluster, referred to as ‘cohesion’ in comparison to the other clusters called ‘separation’.

The silhouette coefficient or silhouette score kmeans is a measure of how similar a data point is within-cluster (cohesion) compared to other clusters (separation). In the code above, within the self-defined range of k, we calculate the silhouette scors for each iteration and try to find an optimal k that gives the maximum silhoutte scores. As it ranges between -1 and +1, a higher silhoutte scores’ optimal cluster is chosen since it showcases more distinct and well-defined clusters space. Through this as well, we see that optimal k value is 6.

Optimal K-Value Clustering

Without Feature Extraction

Code
k = 6
kmeans = KMeans(n_clusters=k, n_init=10, random_state=2339)
optimal_kmeans = kmeans.fit_predict(x)
optimal_kmeans
array([2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 3, 3, 3, 5, 5, 3, 3, 3, 5,
       3, 5, 3, 3, 3, 5, 3, 5, 3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 3, 5, 2, 2,
       2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 3, 3, 3, 5, 5, 3, 3, 3, 5, 3, 5,
       3, 3, 3, 5, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 3, 3, 3, 5, 5,
       3, 3, 3, 5, 3, 5, 3, 3, 3, 5, 3, 5, 3, 3, 5, 5, 2, 2, 2, 2, 2, 2,
       2, 2, 1, 1, 1, 1, 1, 3, 3, 3, 5, 5, 3, 3, 3, 5, 3, 5, 3, 3, 3, 5,
       3, 5, 3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 4, 3, 5, 4, 4, 0, 2, 4, 0, 0,
       2, 2, 2, 2, 1, 1, 1, 1, 3, 3, 3, 3, 5, 3, 5, 3, 3, 3, 5, 3, 5, 3,
       3, 5, 5, 5, 5, 5, 5, 5, 5, 4, 3, 5, 4, 4, 0, 2, 4, 0, 0],
      dtype=int32)
Code

plt.scatter(x['ex_debt_shocks'], x['gdp_growth'], c=optimal_kmeans, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, marker='X', c='red', label='Centroids')
plt.title('K-Means Clustering Results (6 Clusters)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

One of my questions that I wanted to ask was about relationship between External Debt Shocks and the GDP growth. External debt shocks and GDP growth have a complex relationship that is influenced by debt levels, composition, and terms, as well as external factors and policy responses. High and unsustainable debt can stifle economic progress, and the mix of concessional and commercial loans influences the outcome. Debt terms, foreign shocks, and a country’s policy reaction are all important considerations. Global economic conditions and country-specific factors such as governance and political stability add to the complexities of this relationship. Hence, one of the ways in which without the influence mapping between the dependant and independant variable, I wanted to find out relationship between External Debt Shocks and GDP Growth. It is seen that there is clustering where certain clusters being exclusive but most clusters overlapping and non-exclusive in nature. Two clusters are properly created, whereas the other two aren’t.

Code

x['Cluster'] = optimal_kmeans

sns.pairplot(x, hue='Cluster', palette='viridis', markers='X')
plt.suptitle('K-Means Clustering Results (6 Clusters)', y=1.02)
plt.show()
/opt/homebrew/Caskroom/miniconda/base/lib/python3.10/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

If you look at the overall relationships of all the variables with one another, this also gives us an interesting plot to look at.

With Feature Extraction

Feature Extraction is one of the important components for well-defined and better performing clusters, hence an attempt is made to see if they actually help.

Code
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# before pca 

k_means_before = KMeans(n_clusters=6, random_state=42)
optimal_kmeans_before = k_means_before.fit_predict(x)

# after pca 
pca = PCA(n_components=4)
optimal_pca_kmeans = pca.fit_transform(x)

kmeans_after_pca = KMeans(n_clusters=6, random_state=42)
labels_after_pca = kmeans_after_pca.fit_predict(optimal_pca_kmeans)


# after tsne 

tsne = TSNE(n_components=3, perplexity=2, random_state=42)
optima_tsne_kmeans = tsne.fit_transform(x)

kmeans_after_tsne = KMeans(n_clusters=6, random_state=42)
labels_after_tsne = kmeans_after_pca.fit_predict(optima_tsne_kmeans)


silhouette_score_before_pca = silhouette_score(x, optimal_kmeans_before)
print(f"Silhouette Score before PCA: {silhouette_score_before_pca:.4f}")

silhouette_score_after_pca = silhouette_score(optimal_pca_kmeans, labels_after_pca)
print(f"Silhouette Score after PCA: {silhouette_score_after_pca:.4f}")

silhouette_score_after_tsne = silhouette_score(optima_tsne_kmeans, labels_after_tsne)
print(f"Silhouette Score after TSNE: {silhouette_score_after_tsne:.4f}")
Silhouette Score before PCA: 0.4454
Silhouette Score after PCA: 0.4803
Silhouette Score after TSNE: 0.2795
Code

col = ('gdp_growth', 'ex_debt_shocks')
indices = [x.columns.get_loc(c) for c in col]
print(f"The indices of the columns {col} are: {indices}")
The indices of the columns ('gdp_growth', 'ex_debt_shocks') are: [8, 6]

I’m trying to find the indices so that I can compare the exact values for clustering before and after pca

Code
evr = pca.explained_variance_ratio_
cev = np.cumsum(evr)

print("Explained Variance Ratio for Each Component:")
print(evr*100)

# using pca 1 and pca 2 are better
Explained Variance Ratio for Each Component:
[37.66203489 22.03607405 14.10907775  9.93647105]
Code
# Visualize clusters before PCA
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(x['ex_debt_shocks'], x['gdp_growth'], c=optimal_kmeans_before, cmap='viridis', edgecolor='k')
plt.title('Clusters Before PCA')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# Visualize clusters after PCA
plt.subplot(1, 2, 2)
plt.scatter(optimal_pca_kmeans[:, 0], optimal_pca_kmeans[:, 1], c=labels_after_pca, cmap='viridis', edgecolor='k')
plt.title('Clusters After PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')

ax = plt.subplot(1, 3, 3)
ax.scatter(optima_tsne_kmeans[:, 0], optima_tsne_kmeans[:, 1], c=labels_after_tsne, cmap='viridis', edgecolor='k')
ax.set_title('DBSCAN Clustering after t-SNE')
ax.set_xlabel('t-SNE Component 1')
ax.set_ylabel('t-SNE Component 2')
/var/folders/cm/1bq_zvw92w99j_5d1p5jq5v40000gn/T/ipykernel_71191/353519608.py:16: MatplotlibDeprecationWarning: Auto-removal of overlapping axes is deprecated since 3.6 and will be removed two minor releases later; explicitly call ax.remove() as needed.
  ax = plt.subplot(1, 3, 3)
Text(0, 0.5, 't-SNE Component 2')

Here, you can see that using PCA as a feature extraction method has made the clusters more pronounced. You can also see that there now, I can see the association between the GDP growth vs the external debt shocks.

DBSCAN

  • explain what dbscan is

  • what it does

  • how it is found

  • etc

  • why am i using dbscan as an alternative to k means clustering

Optimal Parameter Tuning

Code
for eps in [i/10 for i in range(4, 14)]:
    for min_samples in range(4, 12):
        print("\neps={}".format(eps))
        print("min_samples={}".format(min_samples))
        
        # Apply DBSCAN
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        labels = dbscan.fit_predict(x)
        
        # Check if there is only one unique label
        if len(np.unique(labels)) == 1:
            print("Only one cluster found.")
        else:
            # Calculate Silhouette Score
            silh = silhouette_score(x, labels)
        
            # Print cluster information
            print("Clusters present: {}".format(np.unique(labels)))
            print("Cluster sizes: {}".format(np.bincount(labels + 1)))
            print("Silhouette Score: {}".format(silh*100))

eps=0.4
min_samples=4
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
 23 24 25 26 27 28 29 30 31 32 33]
Cluster sizes: [40  5  5  5  5  4  4  4  4  5  5  5  5  4  4  4  5  4  4  5  5  5  5  5
  5  5  5  5  5  4  4  4  4  4  4]
Silhouette Score: 68.88606121804806

eps=0.4
min_samples=5
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18]
Cluster sizes: [100   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5
   5   5]
Silhouette Score: 21.230240207684638

eps=0.4
min_samples=6
Only one cluster found.

eps=0.4
min_samples=7
Only one cluster found.

eps=0.4
min_samples=8
Only one cluster found.

eps=0.4
min_samples=9
Only one cluster found.

eps=0.4
min_samples=10
Only one cluster found.

eps=0.4
min_samples=11
Only one cluster found.

eps=0.5
min_samples=4
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
 23 24 25 26 27 28 29 30 31 32]
Cluster sizes: [40  5  5  5  5  4  4  4  4  5  5  5  5  4  4  4  5  4  4  5  5  5  5  5
  5  5  5  5  5  4  4  4  4  8]
Silhouette Score: 68.31839030361563

eps=0.5
min_samples=5
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
Cluster sizes: [92  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  8]
Silhouette Score: 25.807678123815197

eps=0.5
min_samples=6
Clusters present: [-1  0]
Cluster sizes: [187   8]
Silhouette Score: -8.23369191923385

eps=0.5
min_samples=7
Clusters present: [-1  0]
Cluster sizes: [187   8]
Silhouette Score: -8.23369191923385

eps=0.5
min_samples=8
Clusters present: [-1  0]
Cluster sizes: [187   8]
Silhouette Score: -8.23369191923385

eps=0.5
min_samples=9
Only one cluster found.

eps=0.5
min_samples=10
Only one cluster found.

eps=0.5
min_samples=11
Only one cluster found.

eps=0.6
min_samples=4
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
 23 24 25 26 27 28 29 30]
Cluster sizes: [40  5  5  5  5  4  4  4  4  5  5  5  5  4  4  4  5  4  4  5  5 10  5  5
 10  5  5  4  4  4  4  8]
Silhouette Score: 65.84073553600564

eps=0.6
min_samples=5
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17]
Cluster sizes: [92  5  5  5  5  5  5  5  5  5  5  5 10  5  5 10  5  5  8]
Silhouette Score: 23.36329604000017

eps=0.6
min_samples=6
Clusters present: [-1  0  1  2]
Cluster sizes: [167  10  10   8]
Silhouette Score: -13.975611481889022

eps=0.6
min_samples=7
Clusters present: [-1  0  1  2]
Cluster sizes: [167  10  10   8]
Silhouette Score: -13.975611481889022

eps=0.6
min_samples=8
Clusters present: [-1  0  1  2]
Cluster sizes: [167  10  10   8]
Silhouette Score: -13.975611481889022

eps=0.6
min_samples=9
Clusters present: [-1  0  1]
Cluster sizes: [175  10  10]
Silhouette Score: -6.498751323825856

eps=0.6
min_samples=10
Clusters present: [-1  0  1]
Cluster sizes: [175  10  10]
Silhouette Score: -6.498751323825856

eps=0.6
min_samples=11
Only one cluster found.

eps=0.7
min_samples=4
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
 23 24 25 26 27 28 29 30]
Cluster sizes: [34  5  5  5  5  4  4  4  4  5  5  5  5  4  4  4  5  4  4  5  5 10  5  5
 10  9  5  4  4  4  8  6]
Silhouette Score: 68.29355859570751

eps=0.7
min_samples=5
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18]
Cluster sizes: [82  5  5  5  5  5  5  5  5  5  5  5 10  5  5 10  9  5  8  6]
Silhouette Score: 29.50855051842023

eps=0.7
min_samples=6
Clusters present: [-1  0  1  2  3  4]
Cluster sizes: [152  10  10   9   8   6]
Silhouette Score: -5.322259507870037

eps=0.7
min_samples=7
Clusters present: [-1  0  1  2  3]
Cluster sizes: [158  10  10   9   8]
Silhouette Score: -7.979946624901074

eps=0.7
min_samples=8
Clusters present: [-1  0  1  2  3]
Cluster sizes: [158  10  10   9   8]
Silhouette Score: -7.979946624901074

eps=0.7
min_samples=9
Clusters present: [-1  0  1  2]
Cluster sizes: [166  10  10   9]
Silhouette Score: -2.6518608260479666

eps=0.7
min_samples=10
Clusters present: [-1  0  1]
Cluster sizes: [175  10  10]
Silhouette Score: -6.498751323825856

eps=0.7
min_samples=11
Only one cluster found.

eps=0.8
min_samples=4
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
 23 24 25 26 27 28 29]
Cluster sizes: [31  5  5  9  5  4  4  4  5  5  5  5  4  4  4  5  4  4  5  5 10  5  5 10
  9  5  4  4  4  8  9]
Silhouette Score: 67.97599532628603

eps=0.8
min_samples=5
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18]
Cluster sizes: [75  5  5  9  5  5  5  5  5  5  5  5 10  5  5 10  9  5  8  9]
Silhouette Score: 33.331798939822995

eps=0.8
min_samples=6
Clusters present: [-1  0  1  2  3  4  5]
Cluster sizes: [140   9  10  10   9   8   9]
Silhouette Score: -10.814878219318274

eps=0.8
min_samples=7
Clusters present: [-1  0  1  2  3  4  5]
Cluster sizes: [140   9  10  10   9   8   9]
Silhouette Score: -10.814878219318274

eps=0.8
min_samples=8
Clusters present: [-1  0  1  2  3  4  5]
Cluster sizes: [140   9  10  10   9   8   9]
Silhouette Score: -10.814878219318274

eps=0.8
min_samples=9
Clusters present: [-1  0  1  2  3  4]
Cluster sizes: [148   9  10  10   9   9]
Silhouette Score: -15.092558521289606

eps=0.8
min_samples=10
Clusters present: [-1  0  1]
Cluster sizes: [175  10  10]
Silhouette Score: -6.498751323825856

eps=0.8
min_samples=11
Only one cluster found.

eps=0.9
min_samples=4
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
 23 24 25 26 27 28 29]
Cluster sizes: [25  5  5 13  5  4  4  5  5  5  5  4  4  4  5  4  4  5  5 10  5  5 10  9
  5  4  4  4  8  9  6]
Silhouette Score: 70.60602982805487

eps=0.9
min_samples=5
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
Cluster sizes: [65  5  5 13  5  5  5  5  5  5  5  5 10  5  5 10  9  5  8  9  6]
Silhouette Score: 39.14394885426615

eps=0.9
min_samples=6
Clusters present: [-1  0  1  2  3  4  5  6]
Cluster sizes: [130  13  10  10   9   8   9   6]
Silhouette Score: -4.885196008483215

eps=0.9
min_samples=7
Clusters present: [-1  0  1  2  3  4  5]
Cluster sizes: [136  13  10  10   9   8   9]
Silhouette Score: -8.854821065134036

eps=0.9
min_samples=8
Clusters present: [-1  0  1  2  3  4  5]
Cluster sizes: [136  13  10  10   9   8   9]
Silhouette Score: -8.854821065134036

eps=0.9
min_samples=9
Clusters present: [-1  0  1  2  3  4]
Cluster sizes: [144  13  10  10   9   9]
Silhouette Score: -13.090075870393386

eps=0.9
min_samples=10
Clusters present: [-1  0  1  2]
Cluster sizes: [162  13  10  10]
Silhouette Score: -10.691496511752485

eps=0.9
min_samples=11
Clusters present: [-1  0]
Cluster sizes: [182  13]
Silhouette Score: 1.83520939806984

eps=1.0
min_samples=4
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
 23 24 25 26 27 28]
Cluster sizes: [22  5  5 13  5  4  4  5  5  5  5  4  4  9  5  4  4  5 10  5  5 10  9  5
  4  7  4  8  9  6]
Silhouette Score: 70.44684009705696

eps=1.0
min_samples=5
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
Cluster sizes: [54  5  5 13  5  5  5  5  5  9  5  5 10  5  5 10  9  5  7  8  9  6]
Silhouette Score: 45.76419139526408

eps=1.0
min_samples=6
Clusters present: [-1  0  1  2  3  4  5  6  7  8]
Cluster sizes: [114  13   9  10  10   9   7   8   9   6]
Silhouette Score: 3.752445333473861

eps=1.0
min_samples=7
Clusters present: [-1  0  1  2  3  4  5  6  7]
Cluster sizes: [120  13   9  10  10   9   7   8   9]
Silhouette Score: -0.29349167395571546

eps=1.0
min_samples=8
Clusters present: [-1  0  1  2  3  4  5  6]
Cluster sizes: [127  13   9  10  10   9   8   9]
Silhouette Score: -5.125500206523189

eps=1.0
min_samples=9
Clusters present: [-1  0  1  2  3  4  5]
Cluster sizes: [135  13   9  10  10   9   9]
Silhouette Score: -9.4095139558794

eps=1.0
min_samples=10
Clusters present: [-1  0  1  2]
Cluster sizes: [162  13  10  10]
Silhouette Score: -10.691496511752485

eps=1.0
min_samples=11
Clusters present: [-1  0]
Cluster sizes: [182  13]
Silhouette Score: 1.83520939806984

eps=1.1
min_samples=4
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
 23 24 25]
Cluster sizes: [22  5  5 13  5  4  4  5  5  5  5  4  4  9  5  4  4  5 10  5  5 19  5  4
  7  4 23]
Silhouette Score: 65.15087686848791

eps=1.1
min_samples=5
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17]
Cluster sizes: [54  5  5 13  5  5  5  5  5  9  5  5 10  5  5 19  5  7 23]
Silhouette Score: 41.26431202721848

eps=1.1
min_samples=6
Clusters present: [-1  0  1  2  3  4  5]
Cluster sizes: [114  13   9  10  19   7  23]
Silhouette Score: 2.3340869175401573

eps=1.1
min_samples=7
Clusters present: [-1  0  1  2  3  4  5]
Cluster sizes: [114  13   9  10  19   7  23]
Silhouette Score: 2.3340869175401573

eps=1.1
min_samples=8
Clusters present: [-1  0  1  2  3  4]
Cluster sizes: [121  13   9  10  19  23]
Silhouette Score: -0.4725187065409471

eps=1.1
min_samples=9
Clusters present: [-1  0  1  2  3  4]
Cluster sizes: [121  13   9  10  19  23]
Silhouette Score: -0.4725187065409471

eps=1.1
min_samples=10
Clusters present: [-1  0  1  2  3]
Cluster sizes: [130  13  10  19  23]
Silhouette Score: -3.713156029180751

eps=1.1
min_samples=11
Clusters present: [-1  0  1  2]
Cluster sizes: [143  13  19  20]
Silhouette Score: -7.5376527625622805

eps=1.2
min_samples=4
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22]
Cluster sizes: [22  5  5 13  5  4  4  5  5  5  5  4 13  5  4  4  5 10  5  5 19  5  8 30]
Silhouette Score: 58.56024918564987

eps=1.2
min_samples=5
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17]
Cluster sizes: [42  5  5 13  5  5  5  5  5 13  5  5 10  5  5 19  5  8 30]
Silhouette Score: 43.0456281819639

eps=1.2
min_samples=6
Clusters present: [-1  0  1  2  3  4  5]
Cluster sizes: [102  13  13  10  19   8  30]
Silhouette Score: 8.47569693355299

eps=1.2
min_samples=7
Clusters present: [-1  0  1  2  3  4  5]
Cluster sizes: [102  13  13  10  19   8  30]
Silhouette Score: 8.47569693355299

eps=1.2
min_samples=8
Clusters present: [-1  0  1  2  3  4  5]
Cluster sizes: [102  13  13  10  19   8  30]
Silhouette Score: 8.47569693355299

eps=1.2
min_samples=9
Clusters present: [-1  0  1  2  3  4]
Cluster sizes: [110  13  13  10  19  30]
Silhouette Score: 4.73499731426314

eps=1.2
min_samples=10
Clusters present: [-1  0  1  2  3  4]
Cluster sizes: [110  13  13  10  19  30]
Silhouette Score: 4.73499731426314

eps=1.2
min_samples=11
Clusters present: [-1  0  1  2  3]
Cluster sizes: [124  13  13  19  26]
Silhouette Score: -1.5832980660409988

eps=1.3
min_samples=4
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
Cluster sizes: [22  5  5 17  5  4  5  5  5  5  4 23  5  4  4  5  5  5 19  5  8 30]
Silhouette Score: 54.27740934904527

eps=1.3
min_samples=5
Clusters present: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16]
Cluster sizes: [38  5  5 17  5  5  5  5  5 23  5  5  5  5 19  5  8 30]
Silhouette Score: 41.24269836937728

eps=1.3
min_samples=6
Clusters present: [-1  0  1  2  3  4]
Cluster sizes: [98 17 23 19  8 30]
Silhouette Score: 8.601210924259624

eps=1.3
min_samples=7
Clusters present: [-1  0  1  2  3  4]
Cluster sizes: [98 17 23 19  8 30]
Silhouette Score: 8.601210924259624

eps=1.3
min_samples=8
Clusters present: [-1  0  1  2  3  4]
Cluster sizes: [98 17 23 19  8 30]
Silhouette Score: 8.601210924259624

eps=1.3
min_samples=9
Clusters present: [-1  0  1  2  3]
Cluster sizes: [106  17  23  19  30]
Silhouette Score: 6.333926375765589

eps=1.3
min_samples=10
Clusters present: [-1  0  1  2  3]
Cluster sizes: [106  17  23  19  30]
Silhouette Score: 6.333926375765589

eps=1.3
min_samples=11
Clusters present: [-1  0  1  2  3]
Cluster sizes: [110  17  23  19  26]
Silhouette Score: 4.286156315832633

This is the code I had worked on during one of the customer segmentation projects from real life KPMG dataset that tries to segment customers on basis of their consumption patterns during the RFM analysis. Within a defined range of parameters (eps and minimum samples), it find the optimal values that give the best solution. In this, we see that, with eps of 1.2 and minimum samples of 4, the silhouette score is around 53%.

Optimal DBSCAN Clustering

Code
dbscan_optimal = DBSCAN(eps=1.2, min_samples=4)
labels_optimal_dbscan = dbscan_optimal.fit_predict(scaled_data)
Code
plt.scatter(x['ex_debt_shocks'], x['gdp_growth'], c=labels_optimal_dbscan, cmap='viridis')
plt.title('DBSCAN Clustering (Optimal Parameters)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

With Feature Extraction

Code

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D  # Importing this for 3D scatter plot
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

# Before PCA
dbscan_optimal = DBSCAN(eps=1.2, min_samples=4)
labels_optimal_dbscan = dbscan_optimal.fit_predict(x)

# After PCA
pca1 = PCA(n_components=4)
optimal_pca_kmeans1 = pca1.fit_transform(x)

dbscan_optimal_after_pca = DBSCAN(eps=1.2, min_samples=4)
labels_optimal_dbscan_pca = dbscan_optimal_after_pca.fit_predict(optimal_pca_kmeans1)

# After t-SNE
tsne = TSNE(n_components=3, perplexity=2, random_state=42)
optima_tsne_kmeans2 = tsne.fit_transform(scaled_data)

dbscan_optimal_after_tsne = DBSCAN(eps=1.2, min_samples=4)
labels_optimal_dbscan_tsne = dbscan_optimal_after_tsne.fit_predict(optima_tsne_kmeans2)
Code

# Silhouette scores
silhouette_score_before_pca = silhouette_score(scaled_data, labels_optimal_dbscan)
print(f"Silhouette Score before PCA: {silhouette_score_before_pca:.4f}")

silhouette_score_after_pca = silhouette_score(optimal_pca_kmeans1, labels_optimal_dbscan_pca)
print(f"Silhouette Score after PCA: {silhouette_score_after_pca:.4f}")

silhouette_score_after_tsne = silhouette_score(optima_tsne_kmeans2, labels_optimal_dbscan_tsne)
print(f"Silhouette Score after TSNE: {silhouette_score_after_tsne:.4f}")
Silhouette Score before PCA: 0.5285
Silhouette Score after PCA: 0.5121
Silhouette Score after TSNE: 0.5379
Code
evr1 = pca1.explained_variance_ratio_
print("Explained Variance Ratio for Each Component:")
print(evr1*100)

x.columns
Explained Variance Ratio for Each Component:
[37.66203489 22.03607405 14.10907775  9.93647105]
Index(['year', 'adj_NNI_g', 'adj_savings_fix_cap_GNI', 'adj_savings_edu_GNI',
       'adj_NNS_GNI', 'ex_imp_growth', 'ex_debt_shocks', 'fdi_net_outflows',
       'gdp_growth', 'short_term_debt_tot_reserves', 'lending_interest_rate',
       'life_exp_birth', 'expense_gdp', 'military expenditure', 'Cluster'],
      dtype='object')

Again PCA1 and PCA2 attempt to capture the maximum variance of the dataset in comparison

Code
# Visualization
plt.figure(figsize=(15, 5))

# Before PCA
plt.subplot(1, 3, 1)
plt.scatter(x['ex_debt_shocks'], x['gdp_growth'], c=labels_optimal_dbscan, cmap='viridis', edgecolor='k')
plt.title('DBSCAN Clustering before PCA')
plt.xlabel('ex_debt_shocks')
plt.ylabel('gdp_growth')

# After PCA
plt.subplot(1, 3, 2)
plt.scatter(optimal_pca_kmeans1[:, 0], optimal_pca_kmeans1[:, 1], c=labels_optimal_dbscan_pca, cmap='viridis', edgecolor='k')
plt.title('DBSCAN Clustering after PCA')

# After t-SNE
ax = plt.subplot(1, 3, 3)
ax.scatter(optima_tsne_kmeans2[:, 0], optima_tsne_kmeans2[:, 1], c=labels_optimal_dbscan_tsne, cmap='viridis', edgecolor='k')
ax.set_title('DBSCAN Clustering after t-SNE')
ax.set_xlabel('t-SNE Component 1')
ax.set_ylabel('t-SNE Component 2')

plt.tight_layout()
plt.show()

If we try to look at the dataset, although the silhouette scores is more for TSNE but if you look at the clusters, you see the clustering is better for DBSCAN after PCA but in terms of Silhouette scores, the answer is different.

Hierarchial Clustering

  • what is hierarchial clustering
  • how does it work
  • what can you do and how is it different in terms of dbscan and kmeans

Finding optimal clusters

  • talk about how when you want to find the optimal clusters, you can look into the dendogram and just do it.
Code
# for agglomerative clustering 

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

linkage_matrix = linkage(x, method='ward')
# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Dendrogram (Before PCA)
dendrogram(linkage_matrix, ax=axes[0])
axes[0].set_title('Hierarchical Clustering Dendrogram (Before PCA)')
axes[0].set_xlabel('Data Points')
axes[0].set_ylabel('Distance')
Text(0, 0.5, 'Distance')

Code
# Silhouette Score
max_clusters = 9
silhouette_scores = []

for n_clusters in range(2, max_clusters + 1):
    agglomerative = AgglomerativeClustering(n_clusters=n_clusters)
    labels = agglomerative.fit_predict(x)
    silhouette_scores.append(silhouette_score(x, labels))

# Plot the silhouette scores
plt.plot(range(2, max_clusters + 1), silhouette_scores, marker='o')
plt.title('Silhouette Score vs. Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.show()

We know that the agglomerative clusters is around 6, when the silhouette score is around 0.44. Now we try to do clustering on basis of before and after both the feature extraction methods of t-SNE and PCA.

Code

# Before PCA
hierarchial_optimal = DBSCAN(eps=1.2, min_samples=4)
labels_optimal_hierarcial = hierarchial_optimal.fit_predict(x)

# After PCA
pca2 = PCA(n_components=4)
optimal_pca_kmeans2 = pca2.fit_transform(x)

hierarchal_optimal_after_pca = AgglomerativeClustering(n_clusters=6)
labels_optimal_hierarchal_pca = hierarchal_optimal_after_pca.fit_predict(optimal_pca_kmeans2)

# After t-SNE
tsne2 = TSNE(n_components=3, perplexity=2, random_state=42)
optima_tsne_hierarchial = tsne.fit_transform(x)

hierarchal_optimal_after_tsne = AgglomerativeClustering(n_clusters=6)
labels_optimal_hierarchal_tsne = hierarchal_optimal_after_pca.fit_predict(optima_tsne_hierarchial)
Code
# Silhouette scores
silhouette_score_before_pca1 = silhouette_score(x, labels_optimal_hierarcial)
print(f"Silhouette Score before PCA: {silhouette_score_before_pca:.4f}")

silhouette_score_after_pca1 = silhouette_score(optimal_pca_kmeans2, labels_optimal_hierarchal_pca)
print(f"Silhouette Score after PCA: {silhouette_score_after_pca:.4f}")

silhouette_score_after_tsne1 = silhouette_score(optima_tsne_hierarchial, labels_optimal_hierarchal_tsne)
print(f"Silhouette Score after TSNE: {silhouette_score_after_tsne:.4f}")
Silhouette Score before PCA: 0.5285
Silhouette Score after PCA: 0.5121
Silhouette Score after TSNE: 0.5379

Here again, we see that Silhouette Score after t-SNE performs better. We shall have to check it through visualization.

Code
# Visualize clusters before hierarchical clustering
plt.figure(figsize=(18, 5))

# Before Hierarchical Clustering
plt.subplot(1, 3, 1)
plt.scatter(x.values[:, 0], x.values[:, 1], c=labels_optimal_hierarcial, cmap='viridis', edgecolor='k')
plt.title('Clusters Before Hierarchical Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# After Hierarchical Clustering with PCA
plt.subplot(1, 3, 2)
plt.scatter(optimal_pca_kmeans2[:, 0], optimal_pca_kmeans2[:, 1], c=labels_optimal_hierarchal_pca, cmap='viridis', edgecolor='k')
plt.title('Clusters After Hierarchical Clustering (PCA)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')

# After Hierarchical Clustering with t-SNE
plt.subplot(1, 3, 3)
plt.scatter(optima_tsne_hierarchial[:, 0], optima_tsne_hierarchial[:, 1], c=labels_optimal_hierarchal_tsne, cmap='viridis', edgecolor='k')
plt.title('Clusters After Hierarchical Clustering (t-SNE)')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')

plt.tight_layout()
plt.show()

When you look at hierarchial clustering, using the agglomerative clustering technique, we see that the clustering after PCA is better as you can see well-defined clusters that are not overlapping one another. You see that the clustering is better after the PCA as the best feature set is used to compare the dataset.

Code
# Create subplots for before and after PCA
fig, axes = plt.subplots(1, 3, figsize=(12, 6))

# Dendrogram before PCA
dendrogram(linkage(x, method='ward'), ax=axes[0])
axes[0].set_title('Hierarchical Clustering Dendrogram (Before PCA)')
axes[0].set_xlabel('Data Points')
axes[0].set_ylabel('Distance')

# Dendrogram after PCA
dendrogram(linkage(optimal_pca_kmeans2, method='ward'), ax=axes[1])
axes[1].set_title('Hierarchical Clustering Dendrogram (After PCA)')
axes[1].set_xlabel('Data Points')
axes[1].set_ylabel('Distance')

# Dendrogram after PCA
dendrogram(linkage(optima_tsne_hierarchial, method='ward'), ax=axes[2])
axes[2].set_title('Hierarchical Clustering Dendrogram (After t-SNE)')
axes[2].set_xlabel('Data Points')
axes[2].set_ylabel('Distance')

plt.tight_layout()
plt.show()

Sources

clustering

https://www.analyticsvidhya.com/blog/2021/01/in-depth-intuition-of-k-means-clustering-algorithm-in-machine-learning/

https://www.analyticsvidhya.com/blog/2021/05/k-mean-getting-the-optimal-number-of-clusters/#:~:text=The%20value%20of%20the%20silhouette,near%200%20denote%20overlapping%20clusters.